Skip to content

Conversation

@bjlittle
Copy link
Member

@bjlittle bjlittle commented Feb 27, 2017

Support for streamed netCDF saving using dask.

Introduces the convenience function iris._lazy_data.convert_nans_array to deal with converting a NaN array to either a masked array of a filled ndarray.

@bjlittle
Copy link
Member Author

bjlittle commented Mar 8, 2017

@pp-mo I assume that I now just re-target this PR against dask branch?

@bjlittle bjlittle changed the base branch from dask_timed to dask March 8, 2017 12:05
@bjlittle bjlittle reopened this Mar 8, 2017
else:
array = ma.masked_array(array, mask=mask,
fill_value=fill_value)
return array
Copy link
Member Author

@bjlittle bjlittle Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo I've combined my array_nans_to_filled and array_nans_to_masked and opted to use the filled kwarg, rather than have separate but very similar functions.

Interested on your take with my raising an exception for the filled = True case but fill_value = None ... seems the right thing to do.

Also, that this routine will perform dtype casting ... again, seems to me like the best place to do it, but whether that capability wrapped up within array_nans_to_masked is appropriate for a higher level function that calls it, such as your as_concrete_data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, mine is very very similar here.
Seems like a case of "furious agreement"

self._numpy_array = array_nans_to_masked(data,
self.fill_value,
self.dtype)
self.dtype = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo I'm expecting you to have stomped over this with your PR, but hopefully they align in logic ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully they align in logic

Dead right, mine is spookily similar !

@pp-mo pp-mo self-requested a review March 9, 2017 12:21
Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not muich to argue over here, so check it out

# Finally, mask or fill the data, as required.
if np.any(mask):
if filled:
if fill_value is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually work...

    if filled:
        if fill_value is None:
            ...

I think it needs to be

    if filled is not False:
        if filled is None:

# First, calculate the mask.
mask = np.isnan(array)
# Now, cast the dtype, if required.
if dtype is not None and dtype != array.dtype:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for efficiency's sake (aargh!!) this is better done after filling -- we don't want to do a masked 'astype' when it isn't needed.

if np.any(mask):
if filled:
if fill_value is None:
emsg = 'Invalid fill value, got {!r}.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As-is there;'s not much point in

    if fill_value is None:
        'Invalid fill value, got {!r}.'

It might as well say the exact truth.
E.G. "Dask result contains missing data, but no fill value was provided."

self._numpy_array = array_nans_to_masked(data,
self.fill_value,
self.dtype)
self.dtype = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully they align in logic

Dead right, mine is spookily similar !

else:
array = ma.masked_array(array, mask=mask,
fill_value=fill_value)
return array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, mine is very very similar here.
Seems like a case of "furious agreement"

return array


def array_nans_to_masked(array, fill_value=None, dtype=None, filled=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name probably needs fixing ?


def array_nans_to_masked(array, fill_value=None, dtype=None, filled=False):
"""
Convert an array into a masked array, by masking any NaN points.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs fix: it does not always return a masked array.
(nor should it, IMHO)

@pp-mo
Copy link
Member

pp-mo commented Mar 9, 2017

I think there are three useful usecases

  • if any missing points then return a masked array (optional : can we set a fill_value ?, do I care ?)
  • fill any missing points with a given fill_value
  • raise error if there are any missing points

(+ in future we might consider a "don't even check for NaNs" option for performance-only-reasons)

@pp-mo
Copy link
Member

pp-mo commented Mar 9, 2017

Current + possible future call signatures for the above usecases
If array_nans_to_masked() = a-n-t-m() ...

  • "mask-nans" = "if any missing points then return a masked array
  • "fill-nans" = "fill any missing points with a given fill_value"
  • "no-nans" = "raise error if there are any missing points"

current calls

  • mask-nans : a-n-t-m() [, fill_value=fv]
  • fill-nans : a-n-t-m(filled=True, fill_value=fv)
  • no-nans : a-n-t-m(filled=True [, fill_value=None]

future ideas ?
How about a-n-t-m(nans='mask', fill_value=None)

  • mask-nans : a-n-t-m() [, nans='mask']
  • fill-nans : `a-n-t-m(nans='fill', fill_value=fv)'
  • no-nans : a-n-t-m(nans=None)

The above still allows you to specify the fill_value of a masked array it creates.
If that isn't needed then a single key can do it all.
How about...
a-n-t-m(nans=None)

  • no-nans : a-n-t-m()
  • mask-nans : a-n-t-m(nans=np.ma.masked)
  • fill-nans : a-n-t-m(nans=fill_value)

@bjlittle
Copy link
Member Author

bjlittle commented Mar 9, 2017

@pp-mo

if any missing points then return a masked array (optional : can we set a fill_value ?, do I care ?)

Agreed use case, interesting part is whether we care to set the fill_value or just opt to use the default fill_value of the dtype as decreed by numpy. We still need the cube.fill_value as that tells the savers what to do, specific to their fileformat - and anyways the user can change that if they care.

If a user plucks a concrete masked data payload from the cube, for whatever reason, then they can set the fill_value to be what they care about, or to the cube.fill_value if that matters to them.

I guess I'm quickly coming to the conclusion that I can't see a use case for why we would want to set the fill_value on the masked data of a cube .... and I missing something obvious here?

fill any missing points with a given fill_value

Yup ... and fill them with the fill_value that is specified ... and this can only be non-None

raise error if there are any missing points

@pp-mo I'm struggling to see when this might be required ... are you thinking of points and bounds of coordinates here? Can you elaborate?

(+ in future we might consider a "don't even check for NaNs" option for performance-only-reasons)

Yup, but not now, right? That's an optimisation that I'd love for us to be in that "happy place" to make, but now just right now.

@pp-mo
Copy link
Member

pp-mo commented Mar 9, 2017

raise error if there are any missing points

@pp-mo I'm struggling to see when this might be required ... are you thinking of points and bounds of coordinates here? Can you elaborate?

Yes, that's exactly what I'm thinking of.

@bjlittle bjlittle force-pushed the dask-netcdf-save branch 3 times, most recently from 237e124 to 6d2f381 Compare March 16, 2017 02:39
@bjlittle
Copy link
Member Author

Ping @pp-mo @lbdreyer @dkillick 😄

# masked result, but it ensures we use a "filled" version of the
# input in this case.
if cube.fill_value is not None:
source_data.fill_value = cube.fill_value
Copy link
Member Author

@bjlittle bjlittle Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way trajectory handles masks is pretty ropey ... this change appears appropriate.

The preceding inline code comment says it all really ...

@bjlittle bjlittle requested review from DPeterK and lbdreyer March 16, 2017 11:13
@bjlittle
Copy link
Member Author

@pp-mo @lbdreyer @dkillick someone should own this PR and drive the review forward ... tick, tock goes the clock.

# Check the fill value is appropriate for the
# target result dtype.
try:
[fill_value] = np.asarray([nans], dtype=result_dtype)
Copy link
Member Author

@bjlittle bjlittle Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the dtype of the array not the result_dtype ...

[fill_value] = np.asarray([nans], dtype=result_dtype)
except OverflowError:
emsg = 'Fill value of {!r} invalid for result {!r}.'
raise ValueError(emsg.format(nans, result_dtype))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this should use array.dtype not result_dtype

@bjlittle
Copy link
Member Author

bjlittle commented Mar 16, 2017

Downloads to travis seem to be getting throttled at the moment ... which is painful 😱

* nans:
If `nans` is None, then raise an exception if the `array` contains
any NaN values (default).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nans is None is never used so why did you include?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the default in the function formal parameter list.

Also, it's a very useful value to default to because it's gonna raise an exception if it detects that there are NaNs in your data. We're forcing people to care about this. So for example, this is good behaviour to adopt as it will force the issue in cases when streamed saving is being attempted with NaN data, but the fill_value isn't set i.e. None. See here in particular.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nans is None is never used

But it is here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is here ?

I meant when convert_nans_array is used. but I forgot that the fill_value of a cube could be set to None as @bjlittle's comment points out

* nans:
If `nans` is None, then raise an exception if the `array` contains
any NaN values (default).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think nans is descriptive of the purpose of this keyword argument. Perhaps nans_replacement would be better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take on it is, what are my NaN going to be replace with

  • nans=fill_value is replace NaNs with the fill_value
  • nans=ma.masked is replace my NaNs with a mask
  • nan=None is I don't care, there shouldn't be any NaNs in my data, barf if there is

I'm open to a group take on this, as I've been staring at it way too long ... but I'm kinda wedded to the short and sweet nans name ... thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's two votes for change ... that's enough for me ... nans_replacement it is!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nans_substitute is one character shorter if you prefer that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. First impressions last and nans_replacement is perfect.

raise ValueError(emsg)
elif nans is ma.masked:
# Mask the array with the default fill_value.
array = ma.masked_array(array, mask=mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this uses the default fill_value irrespective of what the fill value on the cube is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. We no longer care. The fill_value that we do care about lives on the cube.

self.assertEqual(result.fill_value, expected.fill_value)

def test_nans_filled(self):
fill_value = 666.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👿

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Muh-ha-ha-har!

@pp-mo
Copy link
Member

pp-mo commented Mar 16, 2017

someone should own this PR and drive the review forward

I'm looking at it now, but I don't claim to be in charge.
I've only got 15 mins now, then busy 1300-1400.

@pp-mo
Copy link
Member

pp-mo commented Mar 16, 2017

Re:

I don't think nans is descriptive of the purpose of this keyword argument

I'm think I'm happy with the routine function, but I will support better naming of arguments !

@bjlittle
Copy link
Member Author

@pp-mo @lbdreyer @dkillick Okay guys are there any further objections to this PR?

I've address all of @lbdreyer comments ...

Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the changes in existing code.

There are a few awkward corners in the behaviour of the core routinem + its testing, which I think at least deserve documenting.

If `nans_replacement` is None, then raise an exception if the `array`
contains any NaN values (default behaviour).
If `nans_replacement` is `numpy.ma.masked`, then convert the `array`
to a :class:`~numpy.ma.core.MaskedArray`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: something missing here.
We should explain that in 'masked array' mode the input array is copied, but in "fill or no-nans modes", the result is the input array, which may be modified in-place.

may be modified in-place

P.S. except when we do a change of dtype.
A bit slippery that, isn't it 😬 ??
Perhaps should just say "in some cases, the input is modified in-place" + leave it at that ?

self.assertIs(result, self.array)
expected = np.array([[1.0, fill_value],
[3.0, 4.0]])
self.assertArrayEqual(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also check that result is the input (no copy, fill-in-place)

Copy link
Member Author

@bjlittle bjlittle Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo Well, that's already done on line 80 ... 👓

expected = np.array([[1, fill_value],
[3, 4]],
dtype=dtype)
self.assertArrayEqual(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, result is not the original (not in-place with new dtype).

P.S. did you know you can assign array.dtype ? It does like a C pointer cast 😱 .

self.assertIsInstance(result, ma.MaskedArray)
self.assertIs(result, array)

def test_pass_thru_masked_array_integer(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Un -masked ints are pass-through anyway.
Should be checking that.
It's in the docstring!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo Yup, good spot. Done!

fill value.
* result_dtype:
Cast the resultant array to this target :class:`~numpy.dtype`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not happen with pass-through cases (anything already masked, or integer types).
We should explain that here, somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo This is actually mentioned in the doc-string note. I think that is sufficient.

result = array.data.copy()
result[mask] = np.nan
else:
result = array.data
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer This addresses the following numpy warning:

MaskedArrayFutureWarning: setting an item on a masked array which has a shared mask will not copy the mask and also change the original mask array in the future.
Check the NumPy 1.11 release notes for more information.

Which was generated by array[mask] = np.nan whenever the data was masked, but the dtype was not integral.

@bjlittle
Copy link
Member Author

closes #2347

<?xml version="1.0" ?>
<cubes xmlns="urn:x-iris:cubeml-0.2">
<cube standard_name="eastward_wind" units="m s-1" var_name="wind1">
<cube core-dtype="int32" dtype="int32" standard_name="eastward_wind" units="m s-1" var_name="wind1">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we need to take out these core-dtype from the cml.
Perhaps at the end of this sprint (considering we want to target fill_value/dtype/get-tests-working this sprint)

@lbdreyer
Copy link
Member

All comments have now been addressed so I'm gonna merge this in

@lbdreyer lbdreyer merged commit 8e3ede4 into SciTools:dask Mar 20, 2017
@bjlittle
Copy link
Member Author

@lbdreyer Awesome, thanks!

@bjlittle bjlittle deleted the dask-netcdf-save branch March 20, 2017 15:27
bjlittle added a commit to bjlittle/iris that referenced this pull request May 31, 2017
* NetCDF save with dask support

* Refactor and use array_nans_to_masked

* Add array_nans_to_masked test coverage.

* Refactor.

* Working tests.

* Update comment detail for convert usage.

* Use array.dtype in convert_nans_array

* Review actions.

* Review actions.
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants